Lab 13 assignment

Text processing and visualization

We'll use a Python package wordcloud to make word clouds. To install, in your terminal type:

pip install wordcloud

If you don't have/know pip, you can manually install:

  1. Download the package from https://github.com/amueller/word_cloud/archive/master.zip and unzip the file.

  2. In your terminal, go to that folder (cd word_cloud-master)

  3. Install the requirements:

sudo pip install -r requirements.txt

Install the package:

python setup.py install


In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import nltk
from wordcloud import WordCloud
import pandas as pd
import numpy as np

Let's go back to the IMDB dataset. We'll try to extract some meaningful keywords in the movie titles and make a word cloud with them.


In [2]:
df = pd.read_csv('../imdb.csv', sep = '\t')
df.head(2)


Out[2]:
Title Year Rating Votes
0 !Next? 1994 5.4 5
1 #1 Single 2006 6.1 61

The wordcloud package takes a string as its input, so we'll make the titles into a big string.


In [3]:
text = ' '.join([i for i in df.Title.tolist()])

In [4]:
text[0:30]


Out[4]:
'!Next? #1 Single #7DaysLater #'

In [5]:
wordcloud = WordCloud(max_font_size=40, max_words = 200, background_color="white").generate(text)

Note that you can tune these parameters, the options can be found here.


In [6]:
plt.figure(figsize = (10,7))
plt.imshow(wordcloud)


Out[6]:
<matplotlib.image.AxesImage at 0x1f60ea16eb8>

I'll make a function because we'll do this multiple times.


In [7]:
def draw_wordcloud(s):
    wordcloud = WordCloud(max_font_size=40, max_words = 200, background_color="white").generate(s)
    plt.figure(figsize = (10,7))
    plt.imshow(wordcloud)

What do you think about the word cloud? Are the words really meaningful?

As we discussed in class, the word cloud's quality is mostly determined by the quality of the text that we give it. We can do some text processing to improve the text's quality. NLTK is a classic Python package for such tasks.

The first step of any text processing is usually to tokenize the string: that is, break down the movie titles into words.


In [8]:
from nltk import word_tokenize

In [9]:
tokens = word_tokenize(text)

In [10]:
tokens[0:10]


Out[10]:
['!', 'Next', '?', '#', '1', 'Single', '#', '7DaysLater', '#', 'Bikerlive']

There are a lot of non-word symbols. Let's get rid of them.

TODO: remove the numbers and symbols in the tokens. Hint: you can use Python's isalpha function. After removal, it should look like the following:


In [11]:
tokens = [word for word in tokens if word.isalpha()]

In [12]:
tokens[0:10]


Out[12]:
['Next',
 'Single',
 'Bikerlive',
 'ByMySide',
 'LawstinWoods',
 'lovemilla',
 'nitTWITS',
 'My',
 'Dad',
 'Says']

Some of these words, like "My", are not really meaningful: these are the so-called stopwords. We can pick them out and remove them. NLTK provides a list of common stop words.


In [13]:
from nltk.corpus import stopwords

In [14]:
st = stopwords.words('english')
st[0:5]


Out[14]:
['i', 'me', 'my', 'myself', 'we']

In [15]:
#TODO: remove stop words in the tokens.

In [16]:
tokens = [word for word in tokens if word not in st]

In [17]:
tokens[0:10]


Out[17]:
['Next',
 'Single',
 'Bikerlive',
 'ByMySide',
 'LawstinWoods',
 'lovemilla',
 'nitTWITS',
 'My',
 'Dad',
 'Says']

Let's try the word cloud again.


In [18]:
text = ' '.join([i for i in tokens])

In [19]:
draw_wordcloud(text)


Maybe a little better, but not much?

We can also add words to the stopword list. In our case, we probably don't want the words like "One" on the picture.


In [20]:
st.append("one")

In [21]:
#TODO: add other words that are not meaningful and re-create the word cloud.
st.append("de")
st.append("di")
text = ' '.join([i for i in tokens])
draw_wordcloud(text)


There are still a lot of words that we probably can't make sense of. Some of these, like "Der", "El", are likely non-English words.

To remove non-English words, a simple way is to check each word and see if it is in the English dictionary. One source of a collection of common English words is here. You can download and use it.


In [22]:
english = [line.strip() for line in open('wordsEn.txt')]
english = set(english)

TODO: remove non-English words. Hint: you likely want to convert the English word list to set with set() to speed up the search.


In [23]:
tokens = [word for word in tokens if word not in english]
text = ' '.join([i for i in tokens])
draw_wordcloud(text)


It removes some non-English words, but some others are still there. This may be because they also exist in the English volcabulary for whatever reason.

We may have to manually remove these by adding them to the stop words list.


In [24]:
st.append("De")
st.append("Da")
st.append("Lo")
st.append("Le")
st.append("Da")
st.append("El")
st.append("du")
st.append("la")
st.append("au")

In [25]:
tokens = [word.lower() for word in tokens]
st = [word.lower() for word in st]

In [26]:
#TODO: remove the other non-English words and make a word cloud.
tokens = [word for word in tokens if word not in st]
text = ' '.join([i for i in tokens])
draw_wordcloud(text)



In [27]:
tempary = []
tempary = tokens

Now we finally have a plot where most of the words we can make sense of. However, it seems still not very informative.

The most frequent words in a collection of texts may not be the most meaningful ones. To address this, there is a popular method called TFIDF. In short, it is a way for evaluating the "importance" of words. We can pick out the most important words and draw a word cloud with them. sklearn has an implementation for this.


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer

TFIDF works with a set of documents. Here I'll just divide the tokens into short sublists.


In [29]:
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]

In [30]:
vectorizer = TfidfVectorizer(max_features=1000)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()

In [31]:
text = ' '.join([i for i in tokens])
draw_wordcloud(text)


By setting max_features=1000, I'm telling the vectorizer to only keep the top 1000 words with highest TFIDF scores. You can change this parameter and see how it influences the output.


In [32]:
#TODO: tune the max_features parameter and make another word cloud.
tokens = tempary
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]

In [33]:
vectorizer = TfidfVectorizer(max_features=500)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()

In [34]:
text = np.split(np.asarray(tokens), len(tokens)/3)
text = [' '.join([j for j in i.tolist()]) for i in text]
vectorizer = TfidfVectorizer(max_features=500)
vectorizer.fit_transform(text)
tokens = vectorizer.get_feature_names()


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\Users\Arihant\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in split(ary, indices_or_sections, axis)
    493     try:
--> 494         len(indices_or_sections)
    495     except TypeError:

TypeError: object of type 'float' has no len()

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-34-0286fd380067> in <module>()
----> 1 text = np.split(np.asarray(tokens), len(tokens)/3)
      2 text = [' '.join([j for j in i.tolist()]) for i in text]
      3 vectorizer = TfidfVectorizer(max_features=500)
      4 vectorizer.fit_transform(text)
      5 tokens = vectorizer.get_feature_names()

C:\Users\Arihant\Anaconda3\lib\site-packages\numpy\lib\shape_base.py in split(ary, indices_or_sections, axis)
    498         if N % sections:
    499             raise ValueError(
--> 500                 'array split does not result in an equal division')
    501     res = array_split(ary, indices_or_sections, axis)
    502     return res

ValueError: array split does not result in an equal division

In [34]:
text = ' '.join([i for i in tokens])
draw_wordcloud(text)



In [ ]: